{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Benchmark: Implement Levenshtein term similarity matrix and fast SCM between corpora ([RaRe-Technologies/gensim PR #2016][#2016])\n",
    "\n",
    " [#2016]: https://github.com/RaRe-Technologies/gensim/pull/2016 (Implement Levenshtein term similarity matrix and fast SCM between corpora - Pull Request #2016)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "d429fedf094e00c4bb5c27589d5befb53b2e4b13\r\n"
     ]
    }
   ],
   "source": [
    "!git rev-parse HEAD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from copy import deepcopy\n",
    "from datetime import timedelta\n",
    "from itertools import product\n",
    "import logging\n",
    "from math import floor, ceil, log10\n",
    "import pickle\n",
    "from random import sample, seed, shuffle\n",
    "from time import time\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from tqdm import tqdm_notebook\n",
    "\n",
    "def tqdm(iterable, total=None, desc=None):\n",
    "    if total is None:\n",
    "        total = len(iterable)\n",
    "    for num_done, element in enumerate(tqdm_notebook(iterable, total=total)):\n",
    "        logger.info(\"%s: %d / %d\", desc, num_done, total)\n",
    "        yield element\n",
    "\n",
    "from gensim.corpora import Dictionary\n",
    "import gensim.downloader as api\n",
    "from gensim.similarities.index import AnnoyIndexer\n",
    "from gensim.similarities import SparseTermSimilarityMatrix\n",
    "from gensim.similarities import UniformTermSimilarityIndex\n",
    "from gensim.similarities import LevenshteinSimilarityIndex\n",
    "from gensim.models import WordEmbeddingSimilarityIndex\n",
    "from gensim.utils import simple_preprocess\n",
    "\n",
    "RANDOM_SEED = 12345\n",
    "\n",
    "logger = logging.getLogger()\n",
    "fhandler = logging.FileHandler(filename='matrix_speed.log', mode='a')\n",
    "formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')\n",
    "fhandler.setFormatter(formatter)\n",
    "logger.addHandler(fhandler)\n",
    "logger.setLevel(logging.INFO)\n",
    "\n",
    "pd.set_option('display.max_rows', None, 'display.max_seq_items', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"Repeatedly run a benchmark callable given various configurations and\n",
    "get a list of results.\n",
    "\n",
    "Return a list of results of repeatedly running a benchmark callable.\n",
    "\n",
    "Parameters\n",
    "----------\n",
    "benchmark : callable tuple -> dict\n",
    "    A benchmark callable that accepts a configuration and returns results.\n",
    "configurations : iterable of tuple\n",
    "    An iterable of configurations that are used for calling the benchmark function.\n",
    "results_filename : str\n",
    "    A filename of a file that will be used to persistently store the results using\n",
    "    pickle. If the file exists, then the function will load the stored results\n",
    "    instead of calling the benchmark callable.\n",
    "\n",
    "Returns\n",
    "-------\n",
    "iterable of tuple\n",
    "    The return values of the individual invocations of the benchmark callable.\n",
    "\n",
    "\"\"\"\n",
    "def benchmark_results(benchmark, configurations, results_filename):\n",
    "    try:\n",
    "        with open(results_filename, \"rb\") as file:\n",
    "            results = pickle.load(file)\n",
    "    except IOError:\n",
    "        configurations = list(configurations)\n",
    "        shuffle(configurations)\n",
    "        results = list(tqdm(\n",
    "            (benchmark(configuration) for configuration in configurations),\n",
    "            total=len(configurations), desc=\"benchmark\"))\n",
    "        with open(results_filename, \"wb\") as file:\n",
    "            pickle.dump(results, file)\n",
    "    return results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implement Levenshtein term similarity matrix\n",
    "\n",
    "In Gensim PR [#1827][], we added a base implementation of the soft cosine measure (SCM). The base implementation would create term similarity matrices using a single complex procedure. In the Gensim PR [#2016][], we split the procedure into:\n",
    "\n",
    "- **TermSimilarityIndex** builder classes that produce the $k$ most similar terms for a given term $t$ that are distinct from $t$ along with the term similarities, and\n",
    "- the **SparseTermSimilarityMatrix** director class that constructs term similarity matrices and consumes term similarities produced by **TermSimilarityIndex** instances.\n",
    "\n",
    "One of the benefits of this separation is that we can easily measure the speed at which a **TermSimilarityIndex** builder class produces term similarities and compare this speed with the speed at which the **SparseTermSimilarityMatrix** director class consumes term similarities. This allows us to see which of the classes are a bottleneck that slows down the construction of term similarity matrices.\n",
    "\n",
    "In this notebook, we measure all the currently available builder and director classes. For the measurements, we use the [Google News word embeddings][word2vec-google-news-300] distributed with the C implementation of Word2Vec. From the word embeddings, we will derive a dictionary of 2.01M terms.\n",
    "\n",
    " [word2vec-google-news-300]: https://github.com/mmihaltz/word2vec-GoogleNews-vectors (word2vec-GoogleNews-vectors)\n",
    " [#1827]: https://github.com/RaRe-Technologies/gensim/pull/1827 (Implement Soft Cosine Measure - Pull Request #1827)\n",
    " [#2016]: https://github.com/RaRe-Technologies/gensim/pull/2016 (Implement Levenshtein term similarity matrix and fast SCM between corpora - Pull Request #2016)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "full_model = api.load(\"word2vec-google-news-300\")\n",
    "\n",
    "try:\n",
    "    full_dictionary = Dictionary.load(\"matrix_speed.dictionary\")\n",
    "except IOError:\n",
    "    full_dictionary = Dictionary([[term] for term in full_model.vocab.keys()])\n",
    "    full_dictionary.save(\"matrix_speed.dictionary\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Director class benchmark\n",
    "#### SparseTermSimilarityMatrix\n",
    "First, we measure the speed at which the **SparseTermSimilarityMatrix** director class consumes term similarities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def benchmark(configuration):\n",
    "    dictionary, nonzero_limit, symmetric, positive_definite, repetition = configuration\n",
    "    index = UniformTermSimilarityIndex(dictionary)\n",
    "    \n",
    "    start_time = time()\n",
    "    matrix = SparseTermSimilarityMatrix(\n",
    "        index, dictionary, nonzero_limit=nonzero_limit, symmetric=symmetric,\n",
    "        positive_definite=positive_definite, dtype=np.float16).matrix\n",
    "    end_time = time()\n",
    "    \n",
    "    duration = end_time - start_time\n",
    "    return {\n",
    "        \"dictionary_size\": len(dictionary),\n",
    "        \"nonzero_limit\": nonzero_limit,\n",
    "        \"matrix_nonzero\": matrix.nnz,\n",
    "        \"repetition\": repetition,\n",
    "        \"symmetric\": symmetric,\n",
    "        \"positive_definite\": positive_definite,\n",
    "        \"duration\": duration, }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4aef903a70e24247ad3c889237ed4c48",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(IntProgress(value=0, max=4), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "dictionary_sizes = [10**k for k in range(3, int(ceil(log10(len(full_dictionary)))))]\n",
    "seed(RANDOM_SEED)\n",
    "dictionaries = []\n",
    "for size in tqdm(dictionary_sizes, desc=\"dictionaries\"):\n",
    "    dictionary = Dictionary([sample(list(full_dictionary.values()), size)])\n",
    "    dictionaries.append(dictionary)\n",
    "dictionaries.append(full_dictionary)\n",
    "nonzero_limits = [1, 10, 100]\n",
    "symmetry = (True, False)\n",
    "positive_definiteness = (True, False)\n",
    "repetitions = range(10)\n",
    "\n",
    "configurations = product(dictionaries, nonzero_limits, symmetry, positive_definiteness, repetitions)\n",
    "results = benchmark_results(benchmark, configurations, \"matrix_speed.director_results\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following tables show how long it takes to construct a term similarity matrix (the **duration** column), how many nonzero elements there are in the matrix (the **matrix_nonzero** column) and the mean term similarity consumption speed (the **consumption_speed** column) as we vary the dictionary size (the **dictionary_size** column) the maximum number of nonzero elements outside the diagonal in every column of the matrix (the **nonzero_limit** column), the matrix symmetry constraint (the **symmetric** column), and the matrix positive definiteness constraing (the **positive_definite** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.\n",
    "\n",
    "We can see that the symmetry and positive definiteness constraints severely limit the number of nonzero elements in the resulting matrix. This in turn increases the consumption speed, since we end up throwing away most of the elements that we consume. The effects of the dictionary size on the mean term similarity consumption speed are minor to none."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(results)\n",
    "df[\"consumption_speed\"] = df.dictionary_size * df.nonzero_limit / df.duration\n",
    "df = df.groupby([\"dictionary_size\", \"nonzero_limit\", \"symmetric\", \"positive_definite\"])\n",
    "\n",
    "def display(df):\n",
    "    df[\"duration\"] = [timedelta(0, duration) for duration in df[\"duration\"]]\n",
    "    df[\"matrix_nonzero\"] = [int(nonzero) for nonzero in df[\"matrix_nonzero\"]]\n",
    "    df[\"consumption_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"consumption_speed\"]]\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>duration</th>\n",
       "      <th>matrix_nonzero</th>\n",
       "      <th>consumption_speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th>symmetric</th>\n",
       "      <th>positive_definite</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"12\" valign=\"top\">10000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">1</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.435533</td>\n",
       "      <td>20000</td>\n",
       "      <td>22.96 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.492606</td>\n",
       "      <td>20000</td>\n",
       "      <td>20.30 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.185563</td>\n",
       "      <td>10002</td>\n",
       "      <td>53.90 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.240471</td>\n",
       "      <td>10002</td>\n",
       "      <td>41.59 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">10</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:02.687836</td>\n",
       "      <td>110000</td>\n",
       "      <td>37.21 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.615492</td>\n",
       "      <td>20000</td>\n",
       "      <td>162.49 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.501188</td>\n",
       "      <td>10118</td>\n",
       "      <td>199.53 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:01.380586</td>\n",
       "      <td>10010</td>\n",
       "      <td>72.44 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:25.262807</td>\n",
       "      <td>1010000</td>\n",
       "      <td>39.58 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:01.132524</td>\n",
       "      <td>20000</td>\n",
       "      <td>883.02 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:03.595666</td>\n",
       "      <td>20198</td>\n",
       "      <td>278.13 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:11.818912</td>\n",
       "      <td>10100</td>\n",
       "      <td>84.61 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"12\" valign=\"top\">2010000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">1</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>00:01:31.786585</td>\n",
       "      <td>4020000</td>\n",
       "      <td>21.90 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:01:40.954580</td>\n",
       "      <td>4020000</td>\n",
       "      <td>19.91 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:39.050064</td>\n",
       "      <td>2010002</td>\n",
       "      <td>51.48 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:49.238437</td>\n",
       "      <td>2010002</td>\n",
       "      <td>40.82 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">10</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>00:09:35.470373</td>\n",
       "      <td>22110000</td>\n",
       "      <td>34.93 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:02:02.920334</td>\n",
       "      <td>4020000</td>\n",
       "      <td>163.52 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:01:39.576693</td>\n",
       "      <td>2010118</td>\n",
       "      <td>201.88 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:04:35.646501</td>\n",
       "      <td>2010010</td>\n",
       "      <td>72.92 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>01:42:01.747568</td>\n",
       "      <td>203010000</td>\n",
       "      <td>32.88 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:03:36.420778</td>\n",
       "      <td>4020000</td>\n",
       "      <td>928.75 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:10:58.434060</td>\n",
       "      <td>2020198</td>\n",
       "      <td>305.30 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:39:40.319479</td>\n",
       "      <td>2010100</td>\n",
       "      <td>84.44 Kword pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                 duration  \\\n",
       "dictionary_size nonzero_limit symmetric positive_definite                   \n",
       "10000           1             False     False             00:00:00.435533   \n",
       "                                        True              00:00:00.492606   \n",
       "                              True      False             00:00:00.185563   \n",
       "                                        True              00:00:00.240471   \n",
       "                10            False     False             00:00:02.687836   \n",
       "                                        True              00:00:00.615492   \n",
       "                              True      False             00:00:00.501188   \n",
       "                                        True              00:00:01.380586   \n",
       "                100           False     False             00:00:25.262807   \n",
       "                                        True              00:00:01.132524   \n",
       "                              True      False             00:00:03.595666   \n",
       "                                        True              00:00:11.818912   \n",
       "2010000         1             False     False             00:01:31.786585   \n",
       "                                        True              00:01:40.954580   \n",
       "                              True      False             00:00:39.050064   \n",
       "                                        True              00:00:49.238437   \n",
       "                10            False     False             00:09:35.470373   \n",
       "                                        True              00:02:02.920334   \n",
       "                              True      False             00:01:39.576693   \n",
       "                                        True              00:04:35.646501   \n",
       "                100           False     False             01:42:01.747568   \n",
       "                                        True              00:03:36.420778   \n",
       "                              True      False             00:10:58.434060   \n",
       "                                        True              00:39:40.319479   \n",
       "\n",
       "                                                           matrix_nonzero  \\\n",
       "dictionary_size nonzero_limit symmetric positive_definite                   \n",
       "10000           1             False     False                       20000   \n",
       "                                        True                        20000   \n",
       "                              True      False                       10002   \n",
       "                                        True                        10002   \n",
       "                10            False     False                      110000   \n",
       "                                        True                        20000   \n",
       "                              True      False                       10118   \n",
       "                                        True                        10010   \n",
       "                100           False     False                     1010000   \n",
       "                                        True                        20000   \n",
       "                              True      False                       20198   \n",
       "                                        True                        10100   \n",
       "2010000         1             False     False                     4020000   \n",
       "                                        True                      4020000   \n",
       "                              True      False                     2010002   \n",
       "                                        True                      2010002   \n",
       "                10            False     False                    22110000   \n",
       "                                        True                      4020000   \n",
       "                              True      False                     2010118   \n",
       "                                        True                      2010010   \n",
       "                100           False     False                   203010000   \n",
       "                                        True                      4020000   \n",
       "                              True      False                     2020198   \n",
       "                                        True                      2010100   \n",
       "\n",
       "                                                                consumption_speed  \n",
       "dictionary_size nonzero_limit symmetric positive_definite                          \n",
       "10000           1             False     False               22.96 Kword pairs / s  \n",
       "                                        True                20.30 Kword pairs / s  \n",
       "                              True      False               53.90 Kword pairs / s  \n",
       "                                        True                41.59 Kword pairs / s  \n",
       "                10            False     False               37.21 Kword pairs / s  \n",
       "                                        True               162.49 Kword pairs / s  \n",
       "                              True      False              199.53 Kword pairs / s  \n",
       "                                        True                72.44 Kword pairs / s  \n",
       "                100           False     False               39.58 Kword pairs / s  \n",
       "                                        True               883.02 Kword pairs / s  \n",
       "                              True      False              278.13 Kword pairs / s  \n",
       "                                        True                84.61 Kword pairs / s  \n",
       "2010000         1             False     False               21.90 Kword pairs / s  \n",
       "                                        True                19.91 Kword pairs / s  \n",
       "                              True      False               51.48 Kword pairs / s  \n",
       "                                        True                40.82 Kword pairs / s  \n",
       "                10            False     False               34.93 Kword pairs / s  \n",
       "                                        True               163.52 Kword pairs / s  \n",
       "                              True      False              201.88 Kword pairs / s  \n",
       "                                        True                72.92 Kword pairs / s  \n",
       "                100           False     False               32.88 Kword pairs / s  \n",
       "                                        True               928.75 Kword pairs / s  \n",
       "                              True      False              305.30 Kword pairs / s  \n",
       "                                        True                84.44 Kword pairs / s  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.mean()).loc[\n",
    "    [10000, len(full_dictionary)], :, :].loc[\n",
    "    :, [\"duration\", \"matrix_nonzero\", \"consumption_speed\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>duration</th>\n",
       "      <th>matrix_nonzero</th>\n",
       "      <th>consumption_speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th>symmetric</th>\n",
       "      <th>positive_definite</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"12\" valign=\"top\">10000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">1</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.005334</td>\n",
       "      <td>0</td>\n",
       "      <td>0.28 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.004072</td>\n",
       "      <td>0</td>\n",
       "      <td>0.17 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.003124</td>\n",
       "      <td>0</td>\n",
       "      <td>0.90 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.001797</td>\n",
       "      <td>0</td>\n",
       "      <td>0.31 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">10</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.011986</td>\n",
       "      <td>0</td>\n",
       "      <td>0.17 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.005972</td>\n",
       "      <td>0</td>\n",
       "      <td>1.59 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.002869</td>\n",
       "      <td>0</td>\n",
       "      <td>1.15 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.011411</td>\n",
       "      <td>0</td>\n",
       "      <td>0.60 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.111118</td>\n",
       "      <td>0</td>\n",
       "      <td>0.17 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.007611</td>\n",
       "      <td>0</td>\n",
       "      <td>5.94 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.030875</td>\n",
       "      <td>0</td>\n",
       "      <td>2.38 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.050198</td>\n",
       "      <td>0</td>\n",
       "      <td>0.36 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"12\" valign=\"top\">2010000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">1</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.767305</td>\n",
       "      <td>0</td>\n",
       "      <td>0.18 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.172432</td>\n",
       "      <td>0</td>\n",
       "      <td>0.03 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.346239</td>\n",
       "      <td>0</td>\n",
       "      <td>0.46 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.177075</td>\n",
       "      <td>0</td>\n",
       "      <td>0.15 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">10</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:05.156655</td>\n",
       "      <td>0</td>\n",
       "      <td>0.31 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.631676</td>\n",
       "      <td>0</td>\n",
       "      <td>0.83 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:01.216067</td>\n",
       "      <td>0</td>\n",
       "      <td>2.41 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.547773</td>\n",
       "      <td>0</td>\n",
       "      <td>0.14 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">False</th>\n",
       "      <th>False</th>\n",
       "      <td>00:04:10.371035</td>\n",
       "      <td>0</td>\n",
       "      <td>1.24 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.634416</td>\n",
       "      <td>0</td>\n",
       "      <td>2.73 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">True</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:06.586767</td>\n",
       "      <td>0</td>\n",
       "      <td>3.05 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:09.030932</td>\n",
       "      <td>0</td>\n",
       "      <td>0.32 Kword pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                 duration  \\\n",
       "dictionary_size nonzero_limit symmetric positive_definite                   \n",
       "10000           1             False     False             00:00:00.005334   \n",
       "                                        True              00:00:00.004072   \n",
       "                              True      False             00:00:00.003124   \n",
       "                                        True              00:00:00.001797   \n",
       "                10            False     False             00:00:00.011986   \n",
       "                                        True              00:00:00.005972   \n",
       "                              True      False             00:00:00.002869   \n",
       "                                        True              00:00:00.011411   \n",
       "                100           False     False             00:00:00.111118   \n",
       "                                        True              00:00:00.007611   \n",
       "                              True      False             00:00:00.030875   \n",
       "                                        True              00:00:00.050198   \n",
       "2010000         1             False     False             00:00:00.767305   \n",
       "                                        True              00:00:00.172432   \n",
       "                              True      False             00:00:00.346239   \n",
       "                                        True              00:00:00.177075   \n",
       "                10            False     False             00:00:05.156655   \n",
       "                                        True              00:00:00.631676   \n",
       "                              True      False             00:00:01.216067   \n",
       "                                        True              00:00:00.547773   \n",
       "                100           False     False             00:04:10.371035   \n",
       "                                        True              00:00:00.634416   \n",
       "                              True      False             00:00:06.586767   \n",
       "                                        True              00:00:09.030932   \n",
       "\n",
       "                                                           matrix_nonzero  \\\n",
       "dictionary_size nonzero_limit symmetric positive_definite                   \n",
       "10000           1             False     False                           0   \n",
       "                                        True                            0   \n",
       "                              True      False                           0   \n",
       "                                        True                            0   \n",
       "                10            False     False                           0   \n",
       "                                        True                            0   \n",
       "                              True      False                           0   \n",
       "                                        True                            0   \n",
       "                100           False     False                           0   \n",
       "                                        True                            0   \n",
       "                              True      False                           0   \n",
       "                                        True                            0   \n",
       "2010000         1             False     False                           0   \n",
       "                                        True                            0   \n",
       "                              True      False                           0   \n",
       "                                        True                            0   \n",
       "                10            False     False                           0   \n",
       "                                        True                            0   \n",
       "                              True      False                           0   \n",
       "                                        True                            0   \n",
       "                100           False     False                           0   \n",
       "                                        True                            0   \n",
       "                              True      False                           0   \n",
       "                                        True                            0   \n",
       "\n",
       "                                                              consumption_speed  \n",
       "dictionary_size nonzero_limit symmetric positive_definite                        \n",
       "10000           1             False     False              0.28 Kword pairs / s  \n",
       "                                        True               0.17 Kword pairs / s  \n",
       "                              True      False              0.90 Kword pairs / s  \n",
       "                                        True               0.31 Kword pairs / s  \n",
       "                10            False     False              0.17 Kword pairs / s  \n",
       "                                        True               1.59 Kword pairs / s  \n",
       "                              True      False              1.15 Kword pairs / s  \n",
       "                                        True               0.60 Kword pairs / s  \n",
       "                100           False     False              0.17 Kword pairs / s  \n",
       "                                        True               5.94 Kword pairs / s  \n",
       "                              True      False              2.38 Kword pairs / s  \n",
       "                                        True               0.36 Kword pairs / s  \n",
       "2010000         1             False     False              0.18 Kword pairs / s  \n",
       "                                        True               0.03 Kword pairs / s  \n",
       "                              True      False              0.46 Kword pairs / s  \n",
       "                                        True               0.15 Kword pairs / s  \n",
       "                10            False     False              0.31 Kword pairs / s  \n",
       "                                        True               0.83 Kword pairs / s  \n",
       "                              True      False              2.41 Kword pairs / s  \n",
       "                                        True               0.14 Kword pairs / s  \n",
       "                100           False     False              1.24 Kword pairs / s  \n",
       "                                        True               2.73 Kword pairs / s  \n",
       "                              True      False              3.05 Kword pairs / s  \n",
       "                                        True               0.32 Kword pairs / s  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n",
    "    [10000, len(full_dictionary)], :, :].loc[\n",
    "    :, [\"duration\", \"matrix_nonzero\", \"consumption_speed\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Builder class benchmark\n",
    "#### UniformTermSimilarityIndex\n",
    "First, we measure the speed at which the **UniformTermSimilarityIndex** builder class produces term similarities. **UniformTermSimilarityIndex** is a dummy class that just generates a sequence of constants. It produces much more term similarities per second than the **SparseTermSimilarityMatrix** is capable of consuming and its results will serve as an upper limit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def benchmark(configuration):\n",
    "    dictionary, nonzero_limit, repetition = configuration\n",
    "    \n",
    "    start_time = time()\n",
    "    index = UniformTermSimilarityIndex(dictionary)\n",
    "    end_time = time()\n",
    "    constructor_duration = end_time - start_time\n",
    "    \n",
    "    start_time = time()\n",
    "    for term in dictionary.values():\n",
    "        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):\n",
    "            pass\n",
    "    end_time = time()\n",
    "    production_duration = end_time - start_time\n",
    "    \n",
    "    return {\n",
    "        \"dictionary_size\": len(dictionary),\n",
    "        \"nonzero_limit\": nonzero_limit,\n",
    "        \"repetition\": repetition,\n",
    "        \"constructor_duration\": constructor_duration,\n",
    "        \"production_duration\": production_duration, }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "nonzero_limits = [1, 10, 100, 1000]\n",
    "\n",
    "configurations = product(dictionaries, nonzero_limits, repetitions)\n",
    "results = benchmark_results(benchmark, configurations, \"matrix_speed.builder_results.uniform\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following tables show how long it takes to retrieve the most similar terms for all terms in a dictionary (the **production_duration** column) and the mean term similarity production speed (the **production_speed** column) as we vary the dictionary size (the **dictionary_size** column), and the maximum number of most similar terms that will be retrieved (the **nonzero_limit** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.\n",
    "\n",
    "The **production_speed** is proportional to **nonzero_limit**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(results)\n",
    "df[\"processing_speed\"] = df.dictionary_size ** 2 / df.production_duration\n",
    "df[\"production_speed\"] = df.dictionary_size * df.nonzero_limit / df.production_duration\n",
    "df = df.groupby([\"dictionary_size\", \"nonzero_limit\"])\n",
    "\n",
    "def display(df):\n",
    "    df[\"constructor_duration\"] = [timedelta(0, duration) for duration in df[\"constructor_duration\"]]\n",
    "    df[\"production_duration\"] = [timedelta(0, duration) for duration in df[\"production_duration\"]]\n",
    "    df[\"processing_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"processing_speed\"]]\n",
    "    df[\"production_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"production_speed\"]]\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>production_duration</th>\n",
       "      <th>production_speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th>1</th>\n",
       "      <td>00:00:00.002973</td>\n",
       "      <td>336.41 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>00:00:00.005372</td>\n",
       "      <td>1861.64 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:00:00.026752</td>\n",
       "      <td>3738.79 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1000</th>\n",
       "      <td>00:00:00.290265</td>\n",
       "      <td>3449.16 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">2010000</th>\n",
       "      <th>1</th>\n",
       "      <td>00:00:06.318446</td>\n",
       "      <td>318.12 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>00:00:10.783611</td>\n",
       "      <td>1863.96 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:00:53.108644</td>\n",
       "      <td>3785.04 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1000</th>\n",
       "      <td>00:09:45.103741</td>\n",
       "      <td>3437.36 Kword pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              production_duration         production_speed\n",
       "dictionary_size nonzero_limit                                             \n",
       "1000            1                 00:00:00.002973   336.41 Kword pairs / s\n",
       "                10                00:00:00.005372  1861.64 Kword pairs / s\n",
       "                100               00:00:00.026752  3738.79 Kword pairs / s\n",
       "                1000              00:00:00.290265  3449.16 Kword pairs / s\n",
       "2010000         1                 00:00:06.318446   318.12 Kword pairs / s\n",
       "                10                00:00:10.783611  1863.96 Kword pairs / s\n",
       "                100               00:00:53.108644  3785.04 Kword pairs / s\n",
       "                1000              00:09:45.103741  3437.36 Kword pairs / s"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.mean()).loc[\n",
    "    [1000, len(full_dictionary)], :, :].loc[\n",
    "    :, [\"production_duration\", \"production_speed\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>production_duration</th>\n",
       "      <th>production_speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th>1</th>\n",
       "      <td>00:00:00.000017</td>\n",
       "      <td>1.93 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>00:00:00.000062</td>\n",
       "      <td>21.50 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:00:00.000408</td>\n",
       "      <td>56.66 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1000</th>\n",
       "      <td>00:00:00.010500</td>\n",
       "      <td>123.82 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">2010000</th>\n",
       "      <th>1</th>\n",
       "      <td>00:00:00.023495</td>\n",
       "      <td>1.18 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>00:00:00.035587</td>\n",
       "      <td>6.16 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:00:00.535765</td>\n",
       "      <td>37.76 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1000</th>\n",
       "      <td>00:00:15.037816</td>\n",
       "      <td>89.56 Kword pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              production_duration        production_speed\n",
       "dictionary_size nonzero_limit                                            \n",
       "1000            1                 00:00:00.000017    1.93 Kword pairs / s\n",
       "                10                00:00:00.000062   21.50 Kword pairs / s\n",
       "                100               00:00:00.000408   56.66 Kword pairs / s\n",
       "                1000              00:00:00.010500  123.82 Kword pairs / s\n",
       "2010000         1                 00:00:00.023495    1.18 Kword pairs / s\n",
       "                10                00:00:00.035587    6.16 Kword pairs / s\n",
       "                100               00:00:00.535765   37.76 Kword pairs / s\n",
       "                1000              00:00:15.037816   89.56 Kword pairs / s"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n",
    "    [1000, len(full_dictionary)], :, :].loc[\n",
    "    :, [\"production_duration\", \"production_speed\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### LevenshteinSimilarityIndex\n",
    "Next, we measure the speed at which the **LevenshteinSimilarityIndex** builder class produces term similarities. **LevenshteinSimilarityIndex** is currently just a naïve implementation that produces much fewer term similarities per second than the **SparseTermSimilarityMatrix** class is capable of consuming."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def benchmark(configuration):\n",
    "    dictionary, nonzero_limit, query_terms, repetition = configuration\n",
    "    \n",
    "    start_time = time()\n",
    "    index = LevenshteinSimilarityIndex(dictionary)\n",
    "    end_time = time()\n",
    "    constructor_duration = end_time - start_time\n",
    "    \n",
    "    start_time = time()\n",
    "    for term in query_terms:\n",
    "        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):\n",
    "            pass\n",
    "    end_time = time()\n",
    "    production_duration = end_time - start_time\n",
    "    \n",
    "    return {\n",
    "        \"dictionary_size\": len(dictionary),\n",
    "        \"mean_query_term_length\": np.mean([len(term) for term in query_terms]),\n",
    "        \"nonzero_limit\": nonzero_limit,\n",
    "        \"repetition\": repetition,\n",
    "        \"constructor_duration\": constructor_duration,\n",
    "        \"production_duration\": production_duration, }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "nonzero_limits = [1, 10, 100]\n",
    "seed(RANDOM_SEED)\n",
    "min_dictionary = sorted((len(dictionary), dictionary) for dictionary in dictionaries)[0][1]\n",
    "query_terms = sample(list(min_dictionary.values()), 10)\n",
    "\n",
    "configurations = product(dictionaries, nonzero_limits, [query_terms], repetitions)\n",
    "results = benchmark_results(benchmark, configurations, \"matrix_speed.builder_results.levenshtein\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following tables show how long it takes to retrieve the most similar terms for ten randomly sampled terms from a dictionary (the **production_duration** column), the mean term similarity production speed (the **production_speed** column) and the mean term similarity processing speed (the **processing_speed** column) as we vary the dictionary size (the **dictionary_size** column), and the maximum number of most similar terms that will be retrieved (the **nonzero_limit** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.\n",
    "\n",
    "The **production_speed** is proportional to **nonzero_limit / dictionary_size**. The **processing_speed** is constant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(results)\n",
    "df[\"processing_speed\"] = df.dictionary_size * len(query_terms) / df.production_duration\n",
    "df[\"production_speed\"] = df.nonzero_limit * len(query_terms) / df.production_duration\n",
    "df = df.groupby([\"dictionary_size\", \"nonzero_limit\"])\n",
    "\n",
    "def display(df):\n",
    "    df[\"constructor_duration\"] = [timedelta(0, duration) for duration in df[\"constructor_duration\"]]\n",
    "    df[\"production_duration\"] = [timedelta(0, duration) for duration in df[\"production_duration\"]]\n",
    "    df[\"processing_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"processing_speed\"]]\n",
    "    df[\"production_speed\"] = [\"%.02f word pairs / s\" % speed for speed in df[\"production_speed\"]]\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>production_duration</th>\n",
       "      <th>production_speed</th>\n",
       "      <th>processing_speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">1000</th>\n",
       "      <th>1</th>\n",
       "      <td>00:00:00.055994</td>\n",
       "      <td>178.61 word pairs / s</td>\n",
       "      <td>178.61 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>00:00:00.056097</td>\n",
       "      <td>1782.70 word pairs / s</td>\n",
       "      <td>178.27 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:00:00.056212</td>\n",
       "      <td>17791.65 word pairs / s</td>\n",
       "      <td>177.92 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">1000000</th>\n",
       "      <th>1</th>\n",
       "      <td>00:01:20.618070</td>\n",
       "      <td>0.12 word pairs / s</td>\n",
       "      <td>124.05 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>00:01:20.048238</td>\n",
       "      <td>1.25 word pairs / s</td>\n",
       "      <td>124.92 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:01:20.064999</td>\n",
       "      <td>12.49 word pairs / s</td>\n",
       "      <td>124.90 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">2010000</th>\n",
       "      <th>1</th>\n",
       "      <td>00:02:44.069399</td>\n",
       "      <td>0.06 word pairs / s</td>\n",
       "      <td>122.51 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>00:02:43.914601</td>\n",
       "      <td>0.61 word pairs / s</td>\n",
       "      <td>122.63 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:02:43.892408</td>\n",
       "      <td>6.10 word pairs / s</td>\n",
       "      <td>122.64 Kword pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              production_duration         production_speed  \\\n",
       "dictionary_size nonzero_limit                                                \n",
       "1000            1                 00:00:00.055994    178.61 word pairs / s   \n",
       "                10                00:00:00.056097   1782.70 word pairs / s   \n",
       "                100               00:00:00.056212  17791.65 word pairs / s   \n",
       "1000000         1                 00:01:20.618070      0.12 word pairs / s   \n",
       "                10                00:01:20.048238      1.25 word pairs / s   \n",
       "                100               00:01:20.064999     12.49 word pairs / s   \n",
       "2010000         1                 00:02:44.069399      0.06 word pairs / s   \n",
       "                10                00:02:43.914601      0.61 word pairs / s   \n",
       "                100               00:02:43.892408      6.10 word pairs / s   \n",
       "\n",
       "                                     processing_speed  \n",
       "dictionary_size nonzero_limit                          \n",
       "1000            1              178.61 Kword pairs / s  \n",
       "                10             178.27 Kword pairs / s  \n",
       "                100            177.92 Kword pairs / s  \n",
       "1000000         1              124.05 Kword pairs / s  \n",
       "                10             124.92 Kword pairs / s  \n",
       "                100            124.90 Kword pairs / s  \n",
       "2010000         1              122.51 Kword pairs / s  \n",
       "                10             122.63 Kword pairs / s  \n",
       "                100            122.64 Kword pairs / s  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.mean()).loc[\n",
    "    [1000, 1000000, len(full_dictionary)], :].loc[\n",
    "    :, [\"production_duration\", \"production_speed\", \"processing_speed\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>production_duration</th>\n",
       "      <th>production_speed</th>\n",
       "      <th>processing_speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">1000</th>\n",
       "      <th>1</th>\n",
       "      <td>00:00:00.000673</td>\n",
       "      <td>2.16 word pairs / s</td>\n",
       "      <td>2.16 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>00:00:00.000409</td>\n",
       "      <td>13.06 word pairs / s</td>\n",
       "      <td>1.31 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:00:00.000621</td>\n",
       "      <td>196.80 word pairs / s</td>\n",
       "      <td>1.97 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">1000000</th>\n",
       "      <th>1</th>\n",
       "      <td>00:00:00.810661</td>\n",
       "      <td>0.00 word pairs / s</td>\n",
       "      <td>1.23 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>00:00:00.110013</td>\n",
       "      <td>0.00 word pairs / s</td>\n",
       "      <td>0.17 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:00:00.164959</td>\n",
       "      <td>0.03 word pairs / s</td>\n",
       "      <td>0.26 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">2010000</th>\n",
       "      <th>1</th>\n",
       "      <td>00:00:01.159273</td>\n",
       "      <td>0.00 word pairs / s</td>\n",
       "      <td>0.85 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>00:00:00.429011</td>\n",
       "      <td>0.00 word pairs / s</td>\n",
       "      <td>0.32 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:00:00.433687</td>\n",
       "      <td>0.02 word pairs / s</td>\n",
       "      <td>0.32 Kword pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              production_duration       production_speed  \\\n",
       "dictionary_size nonzero_limit                                              \n",
       "1000            1                 00:00:00.000673    2.16 word pairs / s   \n",
       "                10                00:00:00.000409   13.06 word pairs / s   \n",
       "                100               00:00:00.000621  196.80 word pairs / s   \n",
       "1000000         1                 00:00:00.810661    0.00 word pairs / s   \n",
       "                10                00:00:00.110013    0.00 word pairs / s   \n",
       "                100               00:00:00.164959    0.03 word pairs / s   \n",
       "2010000         1                 00:00:01.159273    0.00 word pairs / s   \n",
       "                10                00:00:00.429011    0.00 word pairs / s   \n",
       "                100               00:00:00.433687    0.02 word pairs / s   \n",
       "\n",
       "                                   processing_speed  \n",
       "dictionary_size nonzero_limit                        \n",
       "1000            1              2.16 Kword pairs / s  \n",
       "                10             1.31 Kword pairs / s  \n",
       "                100            1.97 Kword pairs / s  \n",
       "1000000         1              1.23 Kword pairs / s  \n",
       "                10             0.17 Kword pairs / s  \n",
       "                100            0.26 Kword pairs / s  \n",
       "2010000         1              0.85 Kword pairs / s  \n",
       "                10             0.32 Kword pairs / s  \n",
       "                100            0.32 Kword pairs / s  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n",
    "    [1000, 1000000, len(full_dictionary)], :].loc[\n",
    "    :, [\"production_duration\", \"production_speed\", \"processing_speed\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### WordEmbeddingSimilarityIndex\n",
    "Lastly, we measure the speed at which the **WordEmbeddingSimilarityIndex** builder class constructs an instance and produces term similarities. Gensim currently supports slow and precise nearest neighbor search, and also approximate nearest neighbor search using [ANNOY][]. We evaluate both options.\n",
    "\n",
    " [ANNOY]: https://github.com/spotify/annoy (Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def benchmark(configuration):\n",
    "    (model, dictionary), nonzero_limit, annoy_n_trees, query_terms, repetition = configuration\n",
    "    use_annoy = annoy_n_trees > 0\n",
    "    model.init_sims()\n",
    "    \n",
    "    start_time = time()\n",
    "    if use_annoy:\n",
    "        annoy = AnnoyIndexer(model, annoy_n_trees)\n",
    "        kwargs = {\"indexer\": annoy}\n",
    "    else:\n",
    "        kwargs = {}\n",
    "    index = WordEmbeddingSimilarityIndex(model, kwargs=kwargs)\n",
    "    end_time = time()\n",
    "    constructor_duration = end_time - start_time\n",
    "    \n",
    "    start_time = time()\n",
    "    for term in query_terms:\n",
    "        for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):\n",
    "            pass\n",
    "    end_time = time()\n",
    "    production_duration = end_time - start_time\n",
    "    \n",
    "    return {\n",
    "        \"dictionary_size\": len(dictionary),\n",
    "        \"mean_query_term_length\": np.mean([len(term) for term in query_terms]),\n",
    "        \"nonzero_limit\": nonzero_limit,\n",
    "        \"use_annoy\": use_annoy,\n",
    "        \"annoy_n_trees\": annoy_n_trees,\n",
    "        \"repetition\": repetition,\n",
    "        \"constructor_duration\": constructor_duration,\n",
    "        \"production_duration\": production_duration, }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "842bb1a60f814110a8f20eb44a973397",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(IntProgress(value=0, max=5), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "models = []\n",
    "for dictionary in tqdm(dictionaries, desc=\"models\"):\n",
    "    if dictionary == full_dictionary:\n",
    "        models.append(full_model)\n",
    "        continue\n",
    "    model = full_model.__class__(full_model.vector_size)\n",
    "    model.vocab = {word: deepcopy(full_model.vocab[word]) for word in dictionary.values()}\n",
    "    model.index2entity = []\n",
    "    vector_indices = []\n",
    "    for index, word in enumerate(full_model.index2entity):\n",
    "        if word in model.vocab.keys():\n",
    "            model.index2entity.append(word)\n",
    "            model.vocab[word].index = len(vector_indices)\n",
    "            vector_indices.append(index)\n",
    "    model.vectors = full_model.vectors[vector_indices]\n",
    "    models.append(model)\n",
    "annoy_n_trees = [0] + [10**k for k in range(3)]\n",
    "seed(RANDOM_SEED)\n",
    "query_terms = sample(list(min_dictionary.values()), 1000)\n",
    "\n",
    "configurations = product(zip(models, dictionaries), nonzero_limits, annoy_n_trees, [query_terms], repetitions)\n",
    "results = benchmark_results(benchmark, configurations, \"matrix_speed.builder_results.wordembeddings\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following tables show how long it takes to construct an ANNOY index and the builder class instance (the **constructor_duration** column), how long it takes to retrieve the most similar terms for 1,000 randomly sampled terms from a dictionary (the **production_duration** column), the mean term similarity production speed (the **production_speed** column) and the mean term similarity processing speed (the **processing_speed** column) as we vary the dictionary size (the **dictionary_size** column), the maximum number of most similar terms that will be retrieved (the **nonzero_limit** column), and the number of constructed ANNOY trees (the **annoy_n_trees** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.\n",
    "\n",
    "If we do not use ANNOY (**annoy_n_trees**${}=0$), then **production_speed** is proportional to **nonzero_limit / dictionary_size**. \n",
    "If we do use ANNOY (**annoy_n_trees**${}>0$), then **production_speed** is proportional to **nonzero_limit / (annoy_n_trees)**${}^{1/2}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(results)\n",
    "df[\"processing_speed\"] = df.dictionary_size * len(query_terms) / df.production_duration\n",
    "df[\"production_speed\"] = df.nonzero_limit * len(query_terms) / df.production_duration\n",
    "df = df.groupby([\"dictionary_size\", \"nonzero_limit\", \"annoy_n_trees\"])\n",
    "\n",
    "def display(df):\n",
    "    df[\"constructor_duration\"] = [timedelta(0, duration) for duration in df[\"constructor_duration\"]]\n",
    "    df[\"production_duration\"] = [timedelta(0, duration) for duration in df[\"production_duration\"]]\n",
    "    df[\"processing_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"processing_speed\"]]\n",
    "    df[\"production_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"production_speed\"]]\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>constructor_duration</th>\n",
       "      <th>production_duration</th>\n",
       "      <th>production_speed</th>\n",
       "      <th>processing_speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th>annoy_n_trees</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"6\" valign=\"top\">1000000</th>\n",
       "      <th rowspan=\"3\" valign=\"top\">1</th>\n",
       "      <th>0</th>\n",
       "      <td>00:00:00.000007</td>\n",
       "      <td>00:00:19.962977</td>\n",
       "      <td>0.05 Kword pairs / s</td>\n",
       "      <td>50094.22 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00:00:30.268797</td>\n",
       "      <td>00:00:00.097011</td>\n",
       "      <td>10.32 Kword pairs / s</td>\n",
       "      <td>10320061.76 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:06:23.415982</td>\n",
       "      <td>00:00:00.160870</td>\n",
       "      <td>6.24 Kword pairs / s</td>\n",
       "      <td>6236688.27 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">100</th>\n",
       "      <th>0</th>\n",
       "      <td>00:00:00.000008</td>\n",
       "      <td>00:00:22.868372</td>\n",
       "      <td>4.37 Kword pairs / s</td>\n",
       "      <td>43729.34 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00:00:31.154876</td>\n",
       "      <td>00:00:00.156238</td>\n",
       "      <td>641.91 Kword pairs / s</td>\n",
       "      <td>6419086.99 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:06:23.290572</td>\n",
       "      <td>00:00:01.297445</td>\n",
       "      <td>77.13 Kword pairs / s</td>\n",
       "      <td>771277.71 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"6\" valign=\"top\">2010000</th>\n",
       "      <th rowspan=\"3\" valign=\"top\">1</th>\n",
       "      <th>0</th>\n",
       "      <td>00:00:00.000007</td>\n",
       "      <td>00:01:55.303216</td>\n",
       "      <td>0.01 Kword pairs / s</td>\n",
       "      <td>17432.79 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00:01:34.004196</td>\n",
       "      <td>00:00:00.190463</td>\n",
       "      <td>5.25 Kword pairs / s</td>\n",
       "      <td>10561607.14 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:23:29.796006</td>\n",
       "      <td>00:00:00.339500</td>\n",
       "      <td>2.96 Kword pairs / s</td>\n",
       "      <td>5954865.50 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">100</th>\n",
       "      <th>0</th>\n",
       "      <td>00:00:00.000007</td>\n",
       "      <td>00:02:11.926861</td>\n",
       "      <td>0.76 Kword pairs / s</td>\n",
       "      <td>15236.46 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00:01:35.813414</td>\n",
       "      <td>00:00:00.301120</td>\n",
       "      <td>332.38 Kword pairs / s</td>\n",
       "      <td>6680879.02 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:23:05.155399</td>\n",
       "      <td>00:00:03.031527</td>\n",
       "      <td>33.42 Kword pairs / s</td>\n",
       "      <td>671683.05 Kword pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            constructor_duration  \\\n",
       "dictionary_size nonzero_limit annoy_n_trees                        \n",
       "1000000         1             0                  00:00:00.000007   \n",
       "                              1                  00:00:30.268797   \n",
       "                              100                00:06:23.415982   \n",
       "                100           0                  00:00:00.000008   \n",
       "                              1                  00:00:31.154876   \n",
       "                              100                00:06:23.290572   \n",
       "2010000         1             0                  00:00:00.000007   \n",
       "                              1                  00:01:34.004196   \n",
       "                              100                00:23:29.796006   \n",
       "                100           0                  00:00:00.000007   \n",
       "                              1                  00:01:35.813414   \n",
       "                              100                00:23:05.155399   \n",
       "\n",
       "                                            production_duration  \\\n",
       "dictionary_size nonzero_limit annoy_n_trees                       \n",
       "1000000         1             0                 00:00:19.962977   \n",
       "                              1                 00:00:00.097011   \n",
       "                              100               00:00:00.160870   \n",
       "                100           0                 00:00:22.868372   \n",
       "                              1                 00:00:00.156238   \n",
       "                              100               00:00:01.297445   \n",
       "2010000         1             0                 00:01:55.303216   \n",
       "                              1                 00:00:00.190463   \n",
       "                              100               00:00:00.339500   \n",
       "                100           0                 00:02:11.926861   \n",
       "                              1                 00:00:00.301120   \n",
       "                              100               00:00:03.031527   \n",
       "\n",
       "                                                   production_speed  \\\n",
       "dictionary_size nonzero_limit annoy_n_trees                           \n",
       "1000000         1             0                0.05 Kword pairs / s   \n",
       "                              1               10.32 Kword pairs / s   \n",
       "                              100              6.24 Kword pairs / s   \n",
       "                100           0                4.37 Kword pairs / s   \n",
       "                              1              641.91 Kword pairs / s   \n",
       "                              100             77.13 Kword pairs / s   \n",
       "2010000         1             0                0.01 Kword pairs / s   \n",
       "                              1                5.25 Kword pairs / s   \n",
       "                              100              2.96 Kword pairs / s   \n",
       "                100           0                0.76 Kword pairs / s   \n",
       "                              1              332.38 Kword pairs / s   \n",
       "                              100             33.42 Kword pairs / s   \n",
       "\n",
       "                                                        processing_speed  \n",
       "dictionary_size nonzero_limit annoy_n_trees                               \n",
       "1000000         1             0                 50094.22 Kword pairs / s  \n",
       "                              1              10320061.76 Kword pairs / s  \n",
       "                              100             6236688.27 Kword pairs / s  \n",
       "                100           0                 43729.34 Kword pairs / s  \n",
       "                              1               6419086.99 Kword pairs / s  \n",
       "                              100              771277.71 Kword pairs / s  \n",
       "2010000         1             0                 17432.79 Kword pairs / s  \n",
       "                              1              10561607.14 Kword pairs / s  \n",
       "                              100             5954865.50 Kword pairs / s  \n",
       "                100           0                 15236.46 Kword pairs / s  \n",
       "                              1               6680879.02 Kword pairs / s  \n",
       "                              100              671683.05 Kword pairs / s  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.mean()).loc[\n",
    "    [1000000, len(full_dictionary)], [1, 100], [0, 1, 100]].loc[\n",
    "    :, [\"constructor_duration\", \"production_duration\", \"production_speed\", \"processing_speed\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>constructor_duration</th>\n",
       "      <th>production_duration</th>\n",
       "      <th>production_speed</th>\n",
       "      <th>processing_speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th>annoy_n_trees</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"6\" valign=\"top\">1000000</th>\n",
       "      <th rowspan=\"3\" valign=\"top\">1</th>\n",
       "      <th>0</th>\n",
       "      <td>00:00:00.000002</td>\n",
       "      <td>00:00:00.115644</td>\n",
       "      <td>0.00 Kword pairs / s</td>\n",
       "      <td>286.27 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00:00:01.854097</td>\n",
       "      <td>00:00:00.003517</td>\n",
       "      <td>0.37 Kword pairs / s</td>\n",
       "      <td>367959.55 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:00:04.702035</td>\n",
       "      <td>00:00:00.010444</td>\n",
       "      <td>0.35 Kword pairs / s</td>\n",
       "      <td>350506.05 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">100</th>\n",
       "      <th>0</th>\n",
       "      <td>00:00:00.000002</td>\n",
       "      <td>00:00:00.104872</td>\n",
       "      <td>0.02 Kword pairs / s</td>\n",
       "      <td>198.86 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00:00:01.163678</td>\n",
       "      <td>00:00:00.008939</td>\n",
       "      <td>36.14 Kword pairs / s</td>\n",
       "      <td>361441.71 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:00:06.818568</td>\n",
       "      <td>00:00:00.036979</td>\n",
       "      <td>2.07 Kword pairs / s</td>\n",
       "      <td>20741.69 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"6\" valign=\"top\">2010000</th>\n",
       "      <th rowspan=\"3\" valign=\"top\">1</th>\n",
       "      <th>0</th>\n",
       "      <td>00:00:00.000001</td>\n",
       "      <td>00:00:00.653177</td>\n",
       "      <td>0.00 Kword pairs / s</td>\n",
       "      <td>97.50 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00:00:04.677209</td>\n",
       "      <td>00:00:00.005679</td>\n",
       "      <td>0.16 Kword pairs / s</td>\n",
       "      <td>311832.91 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:01:38.562684</td>\n",
       "      <td>00:00:00.029887</td>\n",
       "      <td>0.22 Kword pairs / s</td>\n",
       "      <td>434681.25 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">100</th>\n",
       "      <th>0</th>\n",
       "      <td>00:00:00.000001</td>\n",
       "      <td>00:00:00.979613</td>\n",
       "      <td>0.01 Kword pairs / s</td>\n",
       "      <td>111.85 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00:00:03.207474</td>\n",
       "      <td>00:00:00.009479</td>\n",
       "      <td>10.18 Kword pairs / s</td>\n",
       "      <td>204614.80 Kword pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>00:00:55.119595</td>\n",
       "      <td>00:00:00.419531</td>\n",
       "      <td>3.46 Kword pairs / s</td>\n",
       "      <td>69543.35 Kword pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            constructor_duration  \\\n",
       "dictionary_size nonzero_limit annoy_n_trees                        \n",
       "1000000         1             0                  00:00:00.000002   \n",
       "                              1                  00:00:01.854097   \n",
       "                              100                00:00:04.702035   \n",
       "                100           0                  00:00:00.000002   \n",
       "                              1                  00:00:01.163678   \n",
       "                              100                00:00:06.818568   \n",
       "2010000         1             0                  00:00:00.000001   \n",
       "                              1                  00:00:04.677209   \n",
       "                              100                00:01:38.562684   \n",
       "                100           0                  00:00:00.000001   \n",
       "                              1                  00:00:03.207474   \n",
       "                              100                00:00:55.119595   \n",
       "\n",
       "                                            production_duration  \\\n",
       "dictionary_size nonzero_limit annoy_n_trees                       \n",
       "1000000         1             0                 00:00:00.115644   \n",
       "                              1                 00:00:00.003517   \n",
       "                              100               00:00:00.010444   \n",
       "                100           0                 00:00:00.104872   \n",
       "                              1                 00:00:00.008939   \n",
       "                              100               00:00:00.036979   \n",
       "2010000         1             0                 00:00:00.653177   \n",
       "                              1                 00:00:00.005679   \n",
       "                              100               00:00:00.029887   \n",
       "                100           0                 00:00:00.979613   \n",
       "                              1                 00:00:00.009479   \n",
       "                              100               00:00:00.419531   \n",
       "\n",
       "                                                  production_speed  \\\n",
       "dictionary_size nonzero_limit annoy_n_trees                          \n",
       "1000000         1             0               0.00 Kword pairs / s   \n",
       "                              1               0.37 Kword pairs / s   \n",
       "                              100             0.35 Kword pairs / s   \n",
       "                100           0               0.02 Kword pairs / s   \n",
       "                              1              36.14 Kword pairs / s   \n",
       "                              100             2.07 Kword pairs / s   \n",
       "2010000         1             0               0.00 Kword pairs / s   \n",
       "                              1               0.16 Kword pairs / s   \n",
       "                              100             0.22 Kword pairs / s   \n",
       "                100           0               0.01 Kword pairs / s   \n",
       "                              1              10.18 Kword pairs / s   \n",
       "                              100             3.46 Kword pairs / s   \n",
       "\n",
       "                                                      processing_speed  \n",
       "dictionary_size nonzero_limit annoy_n_trees                             \n",
       "1000000         1             0                 286.27 Kword pairs / s  \n",
       "                              1              367959.55 Kword pairs / s  \n",
       "                              100            350506.05 Kword pairs / s  \n",
       "                100           0                 198.86 Kword pairs / s  \n",
       "                              1              361441.71 Kword pairs / s  \n",
       "                              100             20741.69 Kword pairs / s  \n",
       "2010000         1             0                  97.50 Kword pairs / s  \n",
       "                              1              311832.91 Kword pairs / s  \n",
       "                              100            434681.25 Kword pairs / s  \n",
       "                100           0                 111.85 Kword pairs / s  \n",
       "                              1              204614.80 Kword pairs / s  \n",
       "                              100             69543.35 Kword pairs / s  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n",
    "    [1000000, len(full_dictionary)], [1, 100], [0, 1, 100]].loc[\n",
    "    :, [\"constructor_duration\", \"production_duration\", \"production_speed\", \"processing_speed\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implement fast SCM between corpora\n",
    "\n",
    "In Gensim PR [#1827][], we added a base implementation of the soft cosine measure (SCM). The base implementation would compute SCM between single documents using the **softcossim** function. In the Gensim PR [#2016][], we intruduced the **SparseTermSimilarityMatrix.inner_product** method, which computes SCM not only between single documents, but also between a document and a corpus, and between two corpora.\n",
    "\n",
    "For the measurements, we use the [Google News word embeddings][word2vec-google-news-300] distributed with the C implementation of Word2Vec. From the word embeddings, we will derive a dictionary of 2.01m terms. As a corpus, we will use a random sample of 100K articles from the 4.92m English [Wikipedia articles][enwiki].\n",
    "\n",
    " [word2vec-google-news-300]: https://github.com/mmihaltz/word2vec-GoogleNews-vectors (word2vec-GoogleNews-vectors)\n",
    " [enwiki]: https://github.com/RaRe-Technologies/gensim-data/releases/tag/wiki-english-20171001 (wiki-english-20171001)\n",
    " [#1827]: https://github.com/RaRe-Technologies/gensim/pull/1827 (Implement Soft Cosine Measure - Pull Request #1827)\n",
    " [#2016]: https://github.com/RaRe-Technologies/gensim/pull/2016 (Implement Levenshtein term similarity matrix and fast SCM between corpora - Pull Request #2016)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "full_model = api.load(\"word2vec-google-news-300\")\n",
    "\n",
    "try:\n",
    "    with open(\"matrix_speed.corpus\", \"rb\") as file:\n",
    "        full_corpus = pickle.load(file)        \n",
    "except IOError:\n",
    "    original_corpus = list(tqdm(api.load(\"wiki-english-20171001\"), desc=\"original_corpus\", total=4924894))\n",
    "    seed(RANDOM_SEED)\n",
    "    full_corpus = [\n",
    "        simple_preprocess(u'\\n'.join(article[\"section_texts\"]))\n",
    "        for article in tqdm(sample(original_corpus, 10**5), desc=\"full_corpus\", total=10**5)]\n",
    "    del original_corpus\n",
    "    with open(\"matrix_speed.corpus\", \"wb\") as file:\n",
    "        pickle.dump(full_corpus, file)\n",
    "\n",
    "try:\n",
    "    full_dictionary = Dictionary.load(\"matrix_speed.dictionary\")\n",
    "except IOError:\n",
    "    full_dictionary = Dictionary([[term] for term in full_model.vocab.keys()])\n",
    "    full_dictionary.save(\"matrix_speed.dictionary\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SCM between two documents\n",
    "First, we measure the speed at which the **inner_product** method produces term similarities between single documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "def benchmark(configuration):\n",
    "    (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration\n",
    "    corpus_size = len(corpus)\n",
    "    corpus = [dictionary.doc2bow(doc) for doc in corpus]\n",
    "    corpus = [vec for vec in corpus if len(vec) > 0]\n",
    "    \n",
    "    start_time = time()\n",
    "    for vec1 in corpus:\n",
    "        for vec2 in corpus:\n",
    "            matrix.inner_product(vec1, vec2, normalized=normalized)\n",
    "    end_time = time()\n",
    "    duration = end_time - start_time\n",
    "    \n",
    "    return {\n",
    "        \"dictionary_size\": matrix.matrix.shape[0],\n",
    "        \"matrix_nonzero\": matrix.matrix.nnz,\n",
    "        \"nonzero_limit\": nonzero_limit,\n",
    "        \"normalized\": normalized,\n",
    "        \"corpus_size\": corpus_size,\n",
    "        \"corpus_actual_size\": len(corpus),\n",
    "        \"corpus_nonzero\": sum(len(vec) for vec in corpus),\n",
    "        \"mean_document_length\": np.mean([len(doc) for doc in corpus]),\n",
    "        \"repetition\": repetition,\n",
    "        \"duration\": duration, }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "110675d5552847819754f0dc5b1c19e1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(IntProgress(value=0, max=2), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "744e400d597440f79b5923dafb1974fc",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(IntProgress(value=0, max=2), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0f84efc0c79a4628a9543736fc5f0c9a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(IntProgress(value=0, max=2), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8a185a8e530e4481b90056222f5f0a1c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(IntProgress(value=0, max=6), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/mnt/storage/home/novotny/.virtualenvs/gensim/lib/python3.4/site-packages/gensim/matutils.py:738: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
      "  if np.issubdtype(vec.dtype, np.int):\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "seed(RANDOM_SEED)\n",
    "dictionary_sizes = [1000, 100000]\n",
    "dictionaries = []\n",
    "for size in tqdm(dictionary_sizes, desc=\"dictionaries\"):\n",
    "    dictionary = Dictionary([sample(list(full_dictionary.values()), size)])\n",
    "    dictionaries.append(dictionary)\n",
    "min_dictionary = sorted((len(dictionary), dictionary) for dictionary in dictionaries)[0][1]\n",
    "\n",
    "corpus_sizes = [100, 1000]\n",
    "corpora = []\n",
    "for size in tqdm(corpus_sizes, desc=\"corpora\"):\n",
    "    corpus = sample(full_corpus, size)\n",
    "    corpora.append(corpus)\n",
    "\n",
    "models = []\n",
    "for dictionary in tqdm(dictionaries, desc=\"models\"):\n",
    "    if dictionary == full_dictionary:\n",
    "        models.append(full_model)\n",
    "        continue\n",
    "    model = full_model.__class__(full_model.vector_size)\n",
    "    model.vocab = {word: deepcopy(full_model.vocab[word]) for word in dictionary.values()}\n",
    "    model.index2entity = []\n",
    "    vector_indices = []\n",
    "    for index, word in enumerate(full_model.index2entity):\n",
    "        if word in model.vocab.keys():\n",
    "            model.index2entity.append(word)\n",
    "            model.vocab[word].index = len(vector_indices)\n",
    "            vector_indices.append(index)\n",
    "    model.vectors = full_model.vectors[vector_indices]\n",
    "    models.append(model)\n",
    "\n",
    "nonzero_limits = [1, 10, 100]\n",
    "matrices = []\n",
    "for (model, dictionary), nonzero_limit in tqdm(\n",
    "        list(product(zip(models, dictionaries), nonzero_limits)), desc=\"matrices\"):\n",
    "    annoy = AnnoyIndexer(model, 1)\n",
    "    index = WordEmbeddingSimilarityIndex(model, kwargs={\"indexer\": annoy})\n",
    "    matrix = SparseTermSimilarityMatrix(index, dictionary, nonzero_limit=nonzero_limit)\n",
    "    matrices.append((matrix, dictionary, nonzero_limit))\n",
    "    del annoy\n",
    "\n",
    "normalization = (True, False)\n",
    "repetitions = range(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "configurations = product(matrices, corpora, normalization, repetitions)\n",
    "results = benchmark_results(benchmark, configurations, \"matrix_speed.inner-product_results.doc_doc\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following tables show how long it takes to compute the **inner_product** method between all document vectors in a corpus (the **duration** column), how many nonzero elements there are in a corpus matrix (the **corpus_nonzero** column), how many nonzero elements there are in a term similarity matrix (the **matrix_nonzero** column) and the mean document similarity production speed (the **speed** column) as we vary the dictionary size (the **dictionary_size** column), the size of the corpus (the **corpus_size** column), the maximum number of nonzero elements in a single column of the matrix (the **nonzero_limit** column), and the matrix symmetry constraint (the **symmetric** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.\n",
    "\n",
    "The **speed** is proportional to the square of the number of unique terms shared by the two document vectors. In our scenario as well as the standard IR scenario, this means **speed** is constant. Computing a normalized inner product (**normalized**${}={}$True) results in a constant speed decrease."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(results)\n",
    "df[\"speed\"] = df.corpus_actual_size**2 / df.duration\n",
    "del df[\"corpus_actual_size\"]\n",
    "df = df.groupby([\"dictionary_size\", \"corpus_size\", \"nonzero_limit\", \"normalized\"])\n",
    "\n",
    "def display(df):\n",
    "    df[\"duration\"] = [timedelta(0, duration) for duration in df[\"duration\"]]\n",
    "    df[\"speed\"] = [\"%.02f Kdoc pairs / s\" % (speed / 1000) for speed in df[\"speed\"]]\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>duration</th>\n",
       "      <th>corpus_nonzero</th>\n",
       "      <th>matrix_nonzero</th>\n",
       "      <th>speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>corpus_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th>normalized</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.007383</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>1.23 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.009028</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>1.01 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.007657</td>\n",
       "      <td>3.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>1.19 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.008238</td>\n",
       "      <td>3.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>1.10 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.414364</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>1.39 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.473789</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>1.22 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.430833</td>\n",
       "      <td>26.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>1.35 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.453477</td>\n",
       "      <td>26.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>1.27 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">100000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:05.236376</td>\n",
       "      <td>423.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>1.29 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:05.623463</td>\n",
       "      <td>423.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>1.20 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:05.083829</td>\n",
       "      <td>423.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>1.33 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:05.576003</td>\n",
       "      <td>423.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>1.21 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:08:59.285347</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>1.26 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:09:57.693219</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>1.14 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:09:23.213450</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>1.21 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:10:10.612458</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>1.12 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                            duration  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False      00:00:00.007383   \n",
       "                                          True       00:00:00.009028   \n",
       "                            100           False      00:00:00.007657   \n",
       "                                          True       00:00:00.008238   \n",
       "                1000        1             False      00:00:00.414364   \n",
       "                                          True       00:00:00.473789   \n",
       "                            100           False      00:00:00.430833   \n",
       "                                          True       00:00:00.453477   \n",
       "100000          100         1             False      00:00:05.236376   \n",
       "                                          True       00:00:05.623463   \n",
       "                            100           False      00:00:05.083829   \n",
       "                                          True       00:00:05.576003   \n",
       "                1000        1             False      00:08:59.285347   \n",
       "                                          True       00:09:57.693219   \n",
       "                            100           False      00:09:23.213450   \n",
       "                                          True       00:10:10.612458   \n",
       "\n",
       "                                                      corpus_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False                  3.0   \n",
       "                                          True                   3.0   \n",
       "                            100           False                  3.0   \n",
       "                                          True                   3.0   \n",
       "                1000        1             False                 26.0   \n",
       "                                          True                  26.0   \n",
       "                            100           False                 26.0   \n",
       "                                          True                  26.0   \n",
       "100000          100         1             False                423.0   \n",
       "                                          True                 423.0   \n",
       "                            100           False                423.0   \n",
       "                                          True                 423.0   \n",
       "                1000        1             False               5162.0   \n",
       "                                          True                5162.0   \n",
       "                            100           False               5162.0   \n",
       "                                          True                5162.0   \n",
       "\n",
       "                                                      matrix_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False               1000.0   \n",
       "                                          True                1000.0   \n",
       "                            100           False              84944.0   \n",
       "                                          True               84944.0   \n",
       "                1000        1             False               1000.0   \n",
       "                                          True                1000.0   \n",
       "                            100           False              84944.0   \n",
       "                                          True               84944.0   \n",
       "100000          100         1             False             101868.0   \n",
       "                                          True              101868.0   \n",
       "                            100           False            8202884.0   \n",
       "                                          True             8202884.0   \n",
       "                1000        1             False             101868.0   \n",
       "                                          True              101868.0   \n",
       "                            100           False            8202884.0   \n",
       "                                          True             8202884.0   \n",
       "\n",
       "                                                                    speed  \n",
       "dictionary_size corpus_size nonzero_limit normalized                       \n",
       "1000            100         1             False       1.23 Kdoc pairs / s  \n",
       "                                          True        1.01 Kdoc pairs / s  \n",
       "                            100           False       1.19 Kdoc pairs / s  \n",
       "                                          True        1.10 Kdoc pairs / s  \n",
       "                1000        1             False       1.39 Kdoc pairs / s  \n",
       "                                          True        1.22 Kdoc pairs / s  \n",
       "                            100           False       1.35 Kdoc pairs / s  \n",
       "                                          True        1.27 Kdoc pairs / s  \n",
       "100000          100         1             False       1.29 Kdoc pairs / s  \n",
       "                                          True        1.20 Kdoc pairs / s  \n",
       "                            100           False       1.33 Kdoc pairs / s  \n",
       "                                          True        1.21 Kdoc pairs / s  \n",
       "                1000        1             False       1.26 Kdoc pairs / s  \n",
       "                                          True        1.14 Kdoc pairs / s  \n",
       "                            100           False       1.21 Kdoc pairs / s  \n",
       "                                          True        1.12 Kdoc pairs / s  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.mean()).loc[\n",
    "    [1000, 100000], :, [1, 100], :].loc[\n",
    "    :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>duration</th>\n",
       "      <th>corpus_nonzero</th>\n",
       "      <th>matrix_nonzero</th>\n",
       "      <th>speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>corpus_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th>normalized</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.000871</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.13 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.001315</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.14 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.000893</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.12 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.000631</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.08 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.014460</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.05 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.025250</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.07 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.039088</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.11 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.023602</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.06 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">100000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.276359</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.07 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.278806</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.06 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.286781</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.07 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.313397</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.06 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:14.321101</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.03 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:23.526104</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.05 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:05.899527</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.01 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:24.454422</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.05 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                            duration  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False      00:00:00.000871   \n",
       "                                          True       00:00:00.001315   \n",
       "                            100           False      00:00:00.000893   \n",
       "                                          True       00:00:00.000631   \n",
       "                1000        1             False      00:00:00.014460   \n",
       "                                          True       00:00:00.025250   \n",
       "                            100           False      00:00:00.039088   \n",
       "                                          True       00:00:00.023602   \n",
       "100000          100         1             False      00:00:00.276359   \n",
       "                                          True       00:00:00.278806   \n",
       "                            100           False      00:00:00.286781   \n",
       "                                          True       00:00:00.313397   \n",
       "                1000        1             False      00:00:14.321101   \n",
       "                                          True       00:00:23.526104   \n",
       "                            100           False      00:00:05.899527   \n",
       "                                          True       00:00:24.454422   \n",
       "\n",
       "                                                      corpus_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "100000          100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "\n",
       "                                                      matrix_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "100000          100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "\n",
       "                                                                    speed  \n",
       "dictionary_size corpus_size nonzero_limit normalized                       \n",
       "1000            100         1             False       0.13 Kdoc pairs / s  \n",
       "                                          True        0.14 Kdoc pairs / s  \n",
       "                            100           False       0.12 Kdoc pairs / s  \n",
       "                                          True        0.08 Kdoc pairs / s  \n",
       "                1000        1             False       0.05 Kdoc pairs / s  \n",
       "                                          True        0.07 Kdoc pairs / s  \n",
       "                            100           False       0.11 Kdoc pairs / s  \n",
       "                                          True        0.06 Kdoc pairs / s  \n",
       "100000          100         1             False       0.07 Kdoc pairs / s  \n",
       "                                          True        0.06 Kdoc pairs / s  \n",
       "                            100           False       0.07 Kdoc pairs / s  \n",
       "                                          True        0.06 Kdoc pairs / s  \n",
       "                1000        1             False       0.03 Kdoc pairs / s  \n",
       "                                          True        0.05 Kdoc pairs / s  \n",
       "                            100           False       0.01 Kdoc pairs / s  \n",
       "                                          True        0.05 Kdoc pairs / s  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n",
    "    [1000, 100000], :, [1, 100], :].loc[\n",
    "    :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SCM between a document and a corpus\n",
    "Next, we measure the speed at which the **inner_product** method produces term similarities between documents and a corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "def benchmark(configuration):\n",
    "    (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration\n",
    "    corpus_size = len(corpus)\n",
    "    corpus = [dictionary.doc2bow(doc) for doc in corpus if doc]\n",
    "    \n",
    "    start_time = time()\n",
    "    for vec in corpus:\n",
    "        matrix.inner_product(vec, corpus, normalized=normalized)\n",
    "    end_time = time()\n",
    "    duration = end_time - start_time\n",
    "    \n",
    "    return {\n",
    "        \"dictionary_size\": matrix.matrix.shape[0],\n",
    "        \"matrix_nonzero\": matrix.matrix.nnz,\n",
    "        \"nonzero_limit\": nonzero_limit,\n",
    "        \"normalized\": normalized,\n",
    "        \"corpus_size\": corpus_size,\n",
    "        \"corpus_actual_size\": len(corpus),\n",
    "        \"corpus_nonzero\": sum(len(vec) for vec in corpus),\n",
    "        \"mean_document_length\": np.mean([len(doc) for doc in corpus]),\n",
    "        \"repetition\": repetition,\n",
    "        \"duration\": duration, }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "configurations = product(matrices, corpora, normalization, repetitions)\n",
    "results = benchmark_results(benchmark, configurations, \"matrix_speed.inner-product_results.doc_corpus\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **speed** is inversely proportional to **matrix_nonzero**. Computing a normalized inner product (**normalized**${}={}$True) results in a constant speed decrease."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(results)\n",
    "df[\"speed\"] = df.corpus_actual_size**2 / df.duration\n",
    "del df[\"corpus_actual_size\"]\n",
    "df = df.groupby([\"dictionary_size\", \"corpus_size\", \"nonzero_limit\", \"normalized\"])\n",
    "\n",
    "def display(df):\n",
    "    df[\"duration\"] = [timedelta(0, duration) for duration in df[\"duration\"]]\n",
    "    df[\"speed\"] = [\"%.02f Kdoc pairs / s\" % (speed / 1000) for speed in df[\"speed\"]]\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>duration</th>\n",
       "      <th>corpus_nonzero</th>\n",
       "      <th>matrix_nonzero</th>\n",
       "      <th>speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>corpus_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th>normalized</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.009363</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>1117.12 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.010948</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>954.13 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.014128</td>\n",
       "      <td>3.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>728.91 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.018164</td>\n",
       "      <td>3.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>551.78 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.072091</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>13872.12 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.079284</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>12615.36 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.162483</td>\n",
       "      <td>26.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>6188.43 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.203081</td>\n",
       "      <td>26.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>4924.48 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">100000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.278253</td>\n",
       "      <td>423.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>36.05 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.298519</td>\n",
       "      <td>423.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>33.56 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:36.326167</td>\n",
       "      <td>423.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>0.28 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:36.928802</td>\n",
       "      <td>423.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>0.27 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:07.403301</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>135.08 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:07.794943</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>128.29 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:05:55.674712</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>2.81 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:06:05.561398</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>2.74 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                            duration  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False      00:00:00.009363   \n",
       "                                          True       00:00:00.010948   \n",
       "                            100           False      00:00:00.014128   \n",
       "                                          True       00:00:00.018164   \n",
       "                1000        1             False      00:00:00.072091   \n",
       "                                          True       00:00:00.079284   \n",
       "                            100           False      00:00:00.162483   \n",
       "                                          True       00:00:00.203081   \n",
       "100000          100         1             False      00:00:00.278253   \n",
       "                                          True       00:00:00.298519   \n",
       "                            100           False      00:00:36.326167   \n",
       "                                          True       00:00:36.928802   \n",
       "                1000        1             False      00:00:07.403301   \n",
       "                                          True       00:00:07.794943   \n",
       "                            100           False      00:05:55.674712   \n",
       "                                          True       00:06:05.561398   \n",
       "\n",
       "                                                      corpus_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False                  3.0   \n",
       "                                          True                   3.0   \n",
       "                            100           False                  3.0   \n",
       "                                          True                   3.0   \n",
       "                1000        1             False                 26.0   \n",
       "                                          True                  26.0   \n",
       "                            100           False                 26.0   \n",
       "                                          True                  26.0   \n",
       "100000          100         1             False                423.0   \n",
       "                                          True                 423.0   \n",
       "                            100           False                423.0   \n",
       "                                          True                 423.0   \n",
       "                1000        1             False               5162.0   \n",
       "                                          True                5162.0   \n",
       "                            100           False               5162.0   \n",
       "                                          True                5162.0   \n",
       "\n",
       "                                                      matrix_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False               1000.0   \n",
       "                                          True                1000.0   \n",
       "                            100           False              84944.0   \n",
       "                                          True               84944.0   \n",
       "                1000        1             False               1000.0   \n",
       "                                          True                1000.0   \n",
       "                            100           False              84944.0   \n",
       "                                          True               84944.0   \n",
       "100000          100         1             False             101868.0   \n",
       "                                          True              101868.0   \n",
       "                            100           False            8202884.0   \n",
       "                                          True             8202884.0   \n",
       "                1000        1             False             101868.0   \n",
       "                                          True              101868.0   \n",
       "                            100           False            8202884.0   \n",
       "                                          True             8202884.0   \n",
       "\n",
       "                                                                        speed  \n",
       "dictionary_size corpus_size nonzero_limit normalized                           \n",
       "1000            100         1             False        1117.12 Kdoc pairs / s  \n",
       "                                          True          954.13 Kdoc pairs / s  \n",
       "                            100           False         728.91 Kdoc pairs / s  \n",
       "                                          True          551.78 Kdoc pairs / s  \n",
       "                1000        1             False       13872.12 Kdoc pairs / s  \n",
       "                                          True        12615.36 Kdoc pairs / s  \n",
       "                            100           False        6188.43 Kdoc pairs / s  \n",
       "                                          True         4924.48 Kdoc pairs / s  \n",
       "100000          100         1             False          36.05 Kdoc pairs / s  \n",
       "                                          True           33.56 Kdoc pairs / s  \n",
       "                            100           False           0.28 Kdoc pairs / s  \n",
       "                                          True            0.27 Kdoc pairs / s  \n",
       "                1000        1             False         135.08 Kdoc pairs / s  \n",
       "                                          True          128.29 Kdoc pairs / s  \n",
       "                            100           False           2.81 Kdoc pairs / s  \n",
       "                                          True            2.74 Kdoc pairs / s  "
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.mean()).loc[\n",
    "    [1000, 100000], :, [1, 100], :].loc[\n",
    "    :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>duration</th>\n",
       "      <th>corpus_nonzero</th>\n",
       "      <th>matrix_nonzero</th>\n",
       "      <th>speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>corpus_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th>normalized</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.002120</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>242.09 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.002387</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>207.64 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.002531</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>130.94 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.000911</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>27.68 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.000587</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>112.92 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.001191</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>187.31 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.011944</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>513.79 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.001793</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>43.54 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">100000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.016156</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.06 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.013451</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.47 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:01.339787</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.01 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:01.617340</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.01 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.038961</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.71 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.024154</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.40 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:07.604805</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.06 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:14.799519</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.10 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                            duration  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False      00:00:00.002120   \n",
       "                                          True       00:00:00.002387   \n",
       "                            100           False      00:00:00.002531   \n",
       "                                          True       00:00:00.000911   \n",
       "                1000        1             False      00:00:00.000587   \n",
       "                                          True       00:00:00.001191   \n",
       "                            100           False      00:00:00.011944   \n",
       "                                          True       00:00:00.001793   \n",
       "100000          100         1             False      00:00:00.016156   \n",
       "                                          True       00:00:00.013451   \n",
       "                            100           False      00:00:01.339787   \n",
       "                                          True       00:00:01.617340   \n",
       "                1000        1             False      00:00:00.038961   \n",
       "                                          True       00:00:00.024154   \n",
       "                            100           False      00:00:07.604805   \n",
       "                                          True       00:00:14.799519   \n",
       "\n",
       "                                                      corpus_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "100000          100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "\n",
       "                                                      matrix_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "100000          100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "\n",
       "                                                                      speed  \n",
       "dictionary_size corpus_size nonzero_limit normalized                         \n",
       "1000            100         1             False       242.09 Kdoc pairs / s  \n",
       "                                          True        207.64 Kdoc pairs / s  \n",
       "                            100           False       130.94 Kdoc pairs / s  \n",
       "                                          True         27.68 Kdoc pairs / s  \n",
       "                1000        1             False       112.92 Kdoc pairs / s  \n",
       "                                          True        187.31 Kdoc pairs / s  \n",
       "                            100           False       513.79 Kdoc pairs / s  \n",
       "                                          True         43.54 Kdoc pairs / s  \n",
       "100000          100         1             False         2.06 Kdoc pairs / s  \n",
       "                                          True          1.47 Kdoc pairs / s  \n",
       "                            100           False         0.01 Kdoc pairs / s  \n",
       "                                          True          0.01 Kdoc pairs / s  \n",
       "                1000        1             False         0.71 Kdoc pairs / s  \n",
       "                                          True          0.40 Kdoc pairs / s  \n",
       "                            100           False         0.06 Kdoc pairs / s  \n",
       "                                          True          0.10 Kdoc pairs / s  "
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n",
    "    [1000, 100000], :, [1, 100], :].loc[\n",
    "    :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SCM between two corpora\n",
    "Lastly, we measure the speed at which the **inner_product** method produces term similarities between entire corpora."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "def benchmark(configuration):\n",
    "    (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration\n",
    "    corpus_size = len(corpus)\n",
    "    corpus = [dictionary.doc2bow(doc) for doc in corpus]\n",
    "    corpus = [vec for vec in corpus if len(vec) > 0]\n",
    "    \n",
    "    start_time = time()\n",
    "    matrix.inner_product(corpus, corpus, normalized=normalized)\n",
    "    end_time = time()\n",
    "    duration = end_time - start_time\n",
    "    \n",
    "    return {\n",
    "        \"dictionary_size\": matrix.matrix.shape[0],\n",
    "        \"matrix_nonzero\": matrix.matrix.nnz,\n",
    "        \"nonzero_limit\": nonzero_limit,\n",
    "        \"normalized\": normalized,\n",
    "        \"corpus_size\": corpus_size,\n",
    "        \"corpus_actual_size\": len(corpus),\n",
    "        \"corpus_nonzero\": sum(len(vec) for vec in corpus),\n",
    "        \"mean_document_length\": np.mean([len(doc) for doc in corpus]),\n",
    "        \"repetition\": repetition,\n",
    "        \"duration\": duration, }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "84e1344be5d944fa98368e6b3994944a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(IntProgress(value=0, max=2), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/mnt/storage/home/novotny/.virtualenvs/gensim/lib/python3.4/site-packages/gensim/matutils.py:738: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
      "  if np.issubdtype(vec.dtype, np.int):\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "nonzero_limits = [1000]\n",
    "dense_matrices = []\n",
    "for (model, dictionary), nonzero_limit in tqdm(\n",
    "        list(product(zip(models, dictionaries), nonzero_limits)), desc=\"matrices\"):\n",
    "    annoy = AnnoyIndexer(model, 1)\n",
    "    index = WordEmbeddingSimilarityIndex(model, kwargs={\"indexer\": annoy})\n",
    "    matrix = SparseTermSimilarityMatrix(index, dictionary, nonzero_limit=nonzero_limit)\n",
    "    matrices.append((matrix, dictionary, nonzero_limit))\n",
    "    del annoy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "configurations = product(matrices + dense_matrices, corpora + [full_corpus], normalization, repetitions)\n",
    "results = benchmark_results(benchmark, configurations, \"matrix_speed.inner-product_results.corpus_corpus\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(results)\n",
    "df[\"speed\"] = df.corpus_actual_size**2 / df.duration\n",
    "del df[\"corpus_actual_size\"]\n",
    "df = df.groupby([\"dictionary_size\", \"corpus_size\", \"nonzero_limit\", \"normalized\"])\n",
    "\n",
    "def display(df):\n",
    "    df[\"duration\"] = [timedelta(0, duration) for duration in df[\"duration\"]]\n",
    "    df[\"speed\"] = [\"%.02f Kdoc pairs / s\" % (speed / 1000) for speed in df[\"speed\"]]\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>duration</th>\n",
       "      <th>corpus_nonzero</th>\n",
       "      <th>matrix_nonzero</th>\n",
       "      <th>speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>corpus_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th>normalized</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"24\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"8\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.001403</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>6.69 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.005313</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>1.70 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">10</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.001565</td>\n",
       "      <td>3.0</td>\n",
       "      <td>8634.0</td>\n",
       "      <td>5.80 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.005307</td>\n",
       "      <td>3.0</td>\n",
       "      <td>8634.0</td>\n",
       "      <td>1.70 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.003172</td>\n",
       "      <td>3.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>3.05 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.008461</td>\n",
       "      <td>3.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>1.07 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">1000</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.021377</td>\n",
       "      <td>3.0</td>\n",
       "      <td>838588.0</td>\n",
       "      <td>0.42 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.055234</td>\n",
       "      <td>3.0</td>\n",
       "      <td>838588.0</td>\n",
       "      <td>0.16 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.001376</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>418.61 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.005019</td>\n",
       "      <td>26.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>114.78 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">10</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.001511</td>\n",
       "      <td>26.0</td>\n",
       "      <td>8634.0</td>\n",
       "      <td>381.50 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.005208</td>\n",
       "      <td>26.0</td>\n",
       "      <td>8634.0</td>\n",
       "      <td>110.60 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.003539</td>\n",
       "      <td>26.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>164.03 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.008502</td>\n",
       "      <td>26.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>67.81 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">1000</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.021548</td>\n",
       "      <td>26.0</td>\n",
       "      <td>838588.0</td>\n",
       "      <td>26.73 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.054425</td>\n",
       "      <td>26.0</td>\n",
       "      <td>838588.0</td>\n",
       "      <td>10.59 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">100000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.019915</td>\n",
       "      <td>2914.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>391443.20 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.026118</td>\n",
       "      <td>2914.0</td>\n",
       "      <td>1000.0</td>\n",
       "      <td>298377.75 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">10</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.020152</td>\n",
       "      <td>2914.0</td>\n",
       "      <td>8634.0</td>\n",
       "      <td>386722.55 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.026998</td>\n",
       "      <td>2914.0</td>\n",
       "      <td>8634.0</td>\n",
       "      <td>288567.14 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.028345</td>\n",
       "      <td>2914.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>274905.36 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.041069</td>\n",
       "      <td>2914.0</td>\n",
       "      <td>84944.0</td>\n",
       "      <td>189709.57 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">1000</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.089978</td>\n",
       "      <td>2914.0</td>\n",
       "      <td>838588.0</td>\n",
       "      <td>86598.15 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.185611</td>\n",
       "      <td>2914.0</td>\n",
       "      <td>838588.0</td>\n",
       "      <td>41971.58 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"24\" valign=\"top\">100000</th>\n",
       "      <th rowspan=\"8\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.003345</td>\n",
       "      <td>423.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>2013.92 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.008857</td>\n",
       "      <td>423.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>760.13 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">10</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.032639</td>\n",
       "      <td>423.0</td>\n",
       "      <td>814154.0</td>\n",
       "      <td>206.66 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.080591</td>\n",
       "      <td>423.0</td>\n",
       "      <td>814154.0</td>\n",
       "      <td>83.46 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.488467</td>\n",
       "      <td>423.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>13.77 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:01.454507</td>\n",
       "      <td>423.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>4.62 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">1000</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:04.973667</td>\n",
       "      <td>423.0</td>\n",
       "      <td>89912542.0</td>\n",
       "      <td>1.35 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:15.035711</td>\n",
       "      <td>423.0</td>\n",
       "      <td>89912542.0</td>\n",
       "      <td>0.45 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.010141</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>67139.73 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.016685</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>40798.02 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">10</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.041392</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>814154.0</td>\n",
       "      <td>16444.18 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.091686</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>814154.0</td>\n",
       "      <td>7425.08 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.508916</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>1338.94 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:01.497556</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>454.49 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">1000</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:05.101489</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>89912542.0</td>\n",
       "      <td>133.44 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:15.325415</td>\n",
       "      <td>5162.0</td>\n",
       "      <td>89912542.0</td>\n",
       "      <td>44.42 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"8\" valign=\"top\">100000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:37.145526</td>\n",
       "      <td>525310.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>192578.80 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:45.729004</td>\n",
       "      <td>525310.0</td>\n",
       "      <td>101868.0</td>\n",
       "      <td>156431.36 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">10</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:44.981806</td>\n",
       "      <td>525310.0</td>\n",
       "      <td>814154.0</td>\n",
       "      <td>159029.88 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:54.245450</td>\n",
       "      <td>525310.0</td>\n",
       "      <td>814154.0</td>\n",
       "      <td>131871.88 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:01:15.925860</td>\n",
       "      <td>525310.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>94216.21 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:01:29.232076</td>\n",
       "      <td>525310.0</td>\n",
       "      <td>8202884.0</td>\n",
       "      <td>80177.08 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">1000</th>\n",
       "      <th>False</th>\n",
       "      <td>00:03:17.140191</td>\n",
       "      <td>525310.0</td>\n",
       "      <td>89912542.0</td>\n",
       "      <td>36286.25 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:04:05.865666</td>\n",
       "      <td>525310.0</td>\n",
       "      <td>89912542.0</td>\n",
       "      <td>29097.14 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                            duration  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False      00:00:00.001403   \n",
       "                                          True       00:00:00.005313   \n",
       "                            10            False      00:00:00.001565   \n",
       "                                          True       00:00:00.005307   \n",
       "                            100           False      00:00:00.003172   \n",
       "                                          True       00:00:00.008461   \n",
       "                            1000          False      00:00:00.021377   \n",
       "                                          True       00:00:00.055234   \n",
       "                1000        1             False      00:00:00.001376   \n",
       "                                          True       00:00:00.005019   \n",
       "                            10            False      00:00:00.001511   \n",
       "                                          True       00:00:00.005208   \n",
       "                            100           False      00:00:00.003539   \n",
       "                                          True       00:00:00.008502   \n",
       "                            1000          False      00:00:00.021548   \n",
       "                                          True       00:00:00.054425   \n",
       "                100000      1             False      00:00:00.019915   \n",
       "                                          True       00:00:00.026118   \n",
       "                            10            False      00:00:00.020152   \n",
       "                                          True       00:00:00.026998   \n",
       "                            100           False      00:00:00.028345   \n",
       "                                          True       00:00:00.041069   \n",
       "                            1000          False      00:00:00.089978   \n",
       "                                          True       00:00:00.185611   \n",
       "100000          100         1             False      00:00:00.003345   \n",
       "                                          True       00:00:00.008857   \n",
       "                            10            False      00:00:00.032639   \n",
       "                                          True       00:00:00.080591   \n",
       "                            100           False      00:00:00.488467   \n",
       "                                          True       00:00:01.454507   \n",
       "                            1000          False      00:00:04.973667   \n",
       "                                          True       00:00:15.035711   \n",
       "                1000        1             False      00:00:00.010141   \n",
       "                                          True       00:00:00.016685   \n",
       "                            10            False      00:00:00.041392   \n",
       "                                          True       00:00:00.091686   \n",
       "                            100           False      00:00:00.508916   \n",
       "                                          True       00:00:01.497556   \n",
       "                            1000          False      00:00:05.101489   \n",
       "                                          True       00:00:15.325415   \n",
       "                100000      1             False      00:00:37.145526   \n",
       "                                          True       00:00:45.729004   \n",
       "                            10            False      00:00:44.981806   \n",
       "                                          True       00:00:54.245450   \n",
       "                            100           False      00:01:15.925860   \n",
       "                                          True       00:01:29.232076   \n",
       "                            1000          False      00:03:17.140191   \n",
       "                                          True       00:04:05.865666   \n",
       "\n",
       "                                                      corpus_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False                  3.0   \n",
       "                                          True                   3.0   \n",
       "                            10            False                  3.0   \n",
       "                                          True                   3.0   \n",
       "                            100           False                  3.0   \n",
       "                                          True                   3.0   \n",
       "                            1000          False                  3.0   \n",
       "                                          True                   3.0   \n",
       "                1000        1             False                 26.0   \n",
       "                                          True                  26.0   \n",
       "                            10            False                 26.0   \n",
       "                                          True                  26.0   \n",
       "                            100           False                 26.0   \n",
       "                                          True                  26.0   \n",
       "                            1000          False                 26.0   \n",
       "                                          True                  26.0   \n",
       "                100000      1             False               2914.0   \n",
       "                                          True                2914.0   \n",
       "                            10            False               2914.0   \n",
       "                                          True                2914.0   \n",
       "                            100           False               2914.0   \n",
       "                                          True                2914.0   \n",
       "                            1000          False               2914.0   \n",
       "                                          True                2914.0   \n",
       "100000          100         1             False                423.0   \n",
       "                                          True                 423.0   \n",
       "                            10            False                423.0   \n",
       "                                          True                 423.0   \n",
       "                            100           False                423.0   \n",
       "                                          True                 423.0   \n",
       "                            1000          False                423.0   \n",
       "                                          True                 423.0   \n",
       "                1000        1             False               5162.0   \n",
       "                                          True                5162.0   \n",
       "                            10            False               5162.0   \n",
       "                                          True                5162.0   \n",
       "                            100           False               5162.0   \n",
       "                                          True                5162.0   \n",
       "                            1000          False               5162.0   \n",
       "                                          True                5162.0   \n",
       "                100000      1             False             525310.0   \n",
       "                                          True              525310.0   \n",
       "                            10            False             525310.0   \n",
       "                                          True              525310.0   \n",
       "                            100           False             525310.0   \n",
       "                                          True              525310.0   \n",
       "                            1000          False             525310.0   \n",
       "                                          True              525310.0   \n",
       "\n",
       "                                                      matrix_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False               1000.0   \n",
       "                                          True                1000.0   \n",
       "                            10            False               8634.0   \n",
       "                                          True                8634.0   \n",
       "                            100           False              84944.0   \n",
       "                                          True               84944.0   \n",
       "                            1000          False             838588.0   \n",
       "                                          True              838588.0   \n",
       "                1000        1             False               1000.0   \n",
       "                                          True                1000.0   \n",
       "                            10            False               8634.0   \n",
       "                                          True                8634.0   \n",
       "                            100           False              84944.0   \n",
       "                                          True               84944.0   \n",
       "                            1000          False             838588.0   \n",
       "                                          True              838588.0   \n",
       "                100000      1             False               1000.0   \n",
       "                                          True                1000.0   \n",
       "                            10            False               8634.0   \n",
       "                                          True                8634.0   \n",
       "                            100           False              84944.0   \n",
       "                                          True               84944.0   \n",
       "                            1000          False             838588.0   \n",
       "                                          True              838588.0   \n",
       "100000          100         1             False             101868.0   \n",
       "                                          True              101868.0   \n",
       "                            10            False             814154.0   \n",
       "                                          True              814154.0   \n",
       "                            100           False            8202884.0   \n",
       "                                          True             8202884.0   \n",
       "                            1000          False           89912542.0   \n",
       "                                          True            89912542.0   \n",
       "                1000        1             False             101868.0   \n",
       "                                          True              101868.0   \n",
       "                            10            False             814154.0   \n",
       "                                          True              814154.0   \n",
       "                            100           False            8202884.0   \n",
       "                                          True             8202884.0   \n",
       "                            1000          False           89912542.0   \n",
       "                                          True            89912542.0   \n",
       "                100000      1             False             101868.0   \n",
       "                                          True              101868.0   \n",
       "                            10            False             814154.0   \n",
       "                                          True              814154.0   \n",
       "                            100           False            8202884.0   \n",
       "                                          True             8202884.0   \n",
       "                            1000          False           89912542.0   \n",
       "                                          True            89912542.0   \n",
       "\n",
       "                                                                         speed  \n",
       "dictionary_size corpus_size nonzero_limit normalized                            \n",
       "1000            100         1             False            6.69 Kdoc pairs / s  \n",
       "                                          True             1.70 Kdoc pairs / s  \n",
       "                            10            False            5.80 Kdoc pairs / s  \n",
       "                                          True             1.70 Kdoc pairs / s  \n",
       "                            100           False            3.05 Kdoc pairs / s  \n",
       "                                          True             1.07 Kdoc pairs / s  \n",
       "                            1000          False            0.42 Kdoc pairs / s  \n",
       "                                          True             0.16 Kdoc pairs / s  \n",
       "                1000        1             False          418.61 Kdoc pairs / s  \n",
       "                                          True           114.78 Kdoc pairs / s  \n",
       "                            10            False          381.50 Kdoc pairs / s  \n",
       "                                          True           110.60 Kdoc pairs / s  \n",
       "                            100           False          164.03 Kdoc pairs / s  \n",
       "                                          True            67.81 Kdoc pairs / s  \n",
       "                            1000          False           26.73 Kdoc pairs / s  \n",
       "                                          True            10.59 Kdoc pairs / s  \n",
       "                100000      1             False       391443.20 Kdoc pairs / s  \n",
       "                                          True        298377.75 Kdoc pairs / s  \n",
       "                            10            False       386722.55 Kdoc pairs / s  \n",
       "                                          True        288567.14 Kdoc pairs / s  \n",
       "                            100           False       274905.36 Kdoc pairs / s  \n",
       "                                          True        189709.57 Kdoc pairs / s  \n",
       "                            1000          False        86598.15 Kdoc pairs / s  \n",
       "                                          True         41971.58 Kdoc pairs / s  \n",
       "100000          100         1             False         2013.92 Kdoc pairs / s  \n",
       "                                          True           760.13 Kdoc pairs / s  \n",
       "                            10            False          206.66 Kdoc pairs / s  \n",
       "                                          True            83.46 Kdoc pairs / s  \n",
       "                            100           False           13.77 Kdoc pairs / s  \n",
       "                                          True             4.62 Kdoc pairs / s  \n",
       "                            1000          False            1.35 Kdoc pairs / s  \n",
       "                                          True             0.45 Kdoc pairs / s  \n",
       "                1000        1             False        67139.73 Kdoc pairs / s  \n",
       "                                          True         40798.02 Kdoc pairs / s  \n",
       "                            10            False        16444.18 Kdoc pairs / s  \n",
       "                                          True          7425.08 Kdoc pairs / s  \n",
       "                            100           False         1338.94 Kdoc pairs / s  \n",
       "                                          True           454.49 Kdoc pairs / s  \n",
       "                            1000          False          133.44 Kdoc pairs / s  \n",
       "                                          True            44.42 Kdoc pairs / s  \n",
       "                100000      1             False       192578.80 Kdoc pairs / s  \n",
       "                                          True        156431.36 Kdoc pairs / s  \n",
       "                            10            False       159029.88 Kdoc pairs / s  \n",
       "                                          True        131871.88 Kdoc pairs / s  \n",
       "                            100           False        94216.21 Kdoc pairs / s  \n",
       "                                          True         80177.08 Kdoc pairs / s  \n",
       "                            1000          False        36286.25 Kdoc pairs / s  \n",
       "                                          True         29097.14 Kdoc pairs / s  "
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.mean()).loc[\n",
    "    [1000, 100000], :, [1, 10, 100, 1000], :].loc[\n",
    "    :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>duration</th>\n",
       "      <th>corpus_nonzero</th>\n",
       "      <th>matrix_nonzero</th>\n",
       "      <th>speed</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dictionary_size</th>\n",
       "      <th>corpus_size</th>\n",
       "      <th>nonzero_limit</th>\n",
       "      <th>normalized</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"12\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.000292</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.48 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.000225</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.08 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.000747</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.02 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.000488</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.07 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.000027</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.10 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.000069</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.56 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.000309</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>16.26 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.000268</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.24 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">100000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.000576</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11256.03 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.000574</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6512.19 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.000562</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>5233.50 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.000609</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2743.63 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"12\" valign=\"top\">100000</th>\n",
       "      <th rowspan=\"4\" valign=\"top\">100</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.000152</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>98.97 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.000322</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>28.10 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.004997</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.14 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.022206</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.07 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">1000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.000210</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1420.00 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.000192</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>467.23 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.019022</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>45.91 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.004431</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.35 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">100000</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.024466</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>126.77 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:00.062447</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>213.64 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">100</th>\n",
       "      <th>False</th>\n",
       "      <td>00:00:00.087692</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>108.55 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>00:00:01.065889</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>968.80 Kdoc pairs / s</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                            duration  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False      00:00:00.000292   \n",
       "                                          True       00:00:00.000225   \n",
       "                            100           False      00:00:00.000747   \n",
       "                                          True       00:00:00.000488   \n",
       "                1000        1             False      00:00:00.000027   \n",
       "                                          True       00:00:00.000069   \n",
       "                            100           False      00:00:00.000309   \n",
       "                                          True       00:00:00.000268   \n",
       "                100000      1             False      00:00:00.000576   \n",
       "                                          True       00:00:00.000574   \n",
       "                            100           False      00:00:00.000562   \n",
       "                                          True       00:00:00.000609   \n",
       "100000          100         1             False      00:00:00.000152   \n",
       "                                          True       00:00:00.000322   \n",
       "                            100           False      00:00:00.004997   \n",
       "                                          True       00:00:00.022206   \n",
       "                1000        1             False      00:00:00.000210   \n",
       "                                          True       00:00:00.000192   \n",
       "                            100           False      00:00:00.019022   \n",
       "                                          True       00:00:00.004431   \n",
       "                100000      1             False      00:00:00.024466   \n",
       "                                          True       00:00:00.062447   \n",
       "                            100           False      00:00:00.087692   \n",
       "                                          True       00:00:01.065889   \n",
       "\n",
       "                                                      corpus_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                100000      1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "100000          100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                100000      1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "\n",
       "                                                      matrix_nonzero  \\\n",
       "dictionary_size corpus_size nonzero_limit normalized                   \n",
       "1000            100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                100000      1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "100000          100         1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                1000        1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                100000      1             False                  0.0   \n",
       "                                          True                   0.0   \n",
       "                            100           False                  0.0   \n",
       "                                          True                   0.0   \n",
       "\n",
       "                                                                        speed  \n",
       "dictionary_size corpus_size nonzero_limit normalized                           \n",
       "1000            100         1             False           1.48 Kdoc pairs / s  \n",
       "                                          True            0.08 Kdoc pairs / s  \n",
       "                            100           False           1.02 Kdoc pairs / s  \n",
       "                                          True            0.07 Kdoc pairs / s  \n",
       "                1000        1             False           8.10 Kdoc pairs / s  \n",
       "                                          True            1.56 Kdoc pairs / s  \n",
       "                            100           False          16.26 Kdoc pairs / s  \n",
       "                                          True            2.24 Kdoc pairs / s  \n",
       "                100000      1             False       11256.03 Kdoc pairs / s  \n",
       "                                          True         6512.19 Kdoc pairs / s  \n",
       "                            100           False        5233.50 Kdoc pairs / s  \n",
       "                                          True         2743.63 Kdoc pairs / s  \n",
       "100000          100         1             False          98.97 Kdoc pairs / s  \n",
       "                                          True           28.10 Kdoc pairs / s  \n",
       "                            100           False           0.14 Kdoc pairs / s  \n",
       "                                          True            0.07 Kdoc pairs / s  \n",
       "                1000        1             False        1420.00 Kdoc pairs / s  \n",
       "                                          True          467.23 Kdoc pairs / s  \n",
       "                            100           False          45.91 Kdoc pairs / s  \n",
       "                                          True            1.35 Kdoc pairs / s  \n",
       "                100000      1             False         126.77 Kdoc pairs / s  \n",
       "                                          True          213.64 Kdoc pairs / s  \n",
       "                            100           False         108.55 Kdoc pairs / s  \n",
       "                                          True          968.80 Kdoc pairs / s  "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n",
    "    [1000, 100000], :, [1, 100], :].loc[\n",
    "    :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
